k8s节点node not ready的一件事件

#k8s节点node not ready的一件事件| 来源: 网络整理| 查看: 265

测试环境k8s集群的节点经常自己down掉又恢复，影响到上面运行的pod对应的系统。事件如下：

从报错Image garbage collection failed: non-existent label "docker-images"看出来是没有存在的docker images标签之类的。

于是检查kubelet和docker的状态和日志，指令如下：

systemctl status docker -l

systemctl status kubelet -l

[root@node-108 ~]# systemctl status kubelet -l ● kubelet.service - Kubernetes Kubelet Server Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Active: active (running) since 三 2023-03-15 16:29:34 CST; 30min ago [root@node-108 ~]# systemctl status docker -l ● docker.service - Docker Application Container Engine Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─docker-dns.conf, docker-options.conf, docker-orphan-cleanup.conf Active: active (running) since 三 2023-03-15 16:29:34 CST; 30min ago

看出来kubelet和 docker都重启过，根据Image garbage collection failed: non-existent label "docker-images"的报错，猜测是因为docker重启了，所以kubectl报错找不到docker images label。

想确定docker重启频率，于是写了定时任务：

[root@node-108 ~]# crontab -l */1 * * * * echo `date` `systemctl status docker |grep Active` >> /home/dockerstatus.log

从/home/dockerstatus.log看到 16:21到16:24中间，docker重启了，而且定时任务没执行。

2023年 03月 15日星期三 16:21:01 CST Active: active (running) since 三 2023-03-15 16:02:18 CST; 18min ago 2023年 03月 15日星期三 16:24:01 CST Active: activating (start) since 三 2023-03-15 16:23:21 CST; 40s ago 2023年 03月 15日星期三 16:25:01 CST Active: deactivating (stop-sigterm) (Result: timeout) 2023年 03月 15日星期三 16:26:01 CST Active: activating (start) since 三 2023-03-15 16:25:51 CST; 9s ago 2023年 03月 15日星期三 16:27:01 CST Active: deactivating (stop-sigterm) (Result: timeout) 2023年 03月 15日星期三 16:28:01 CST Active: deactivating (stop-sigterm) (Result: timeout) 2023年 03月 15日星期三 16:29:01 CST Active: activating (start) since 三 2023-03-15 16:28:22 CST; 39s ago 2023年 03月 15日星期三 16:30:01 CST Active: active (running) since 三 2023-03-15 16:29:34 CST; 27s ago 2023年 03月 15日星期三 16:31:01 CST Active: active (running) since 三 2023-03-15 16:29:34 CST; 1min 27s ago 2023年 03月 15日星期三 16:32:01 CST Active: active (running) since 三 2023-03-15 16:29:34 CST; 2min 27s ago 2023年 03月 15日星期三 16:33:01 CST Active: active (running) since 三 2023-03-15 16:29:34 CST; 3min 27s ago 2023年 03月 15日星期三 16:34:01 CST Active: active (running) since 三 2023-03-15 16:29:34 CST; 4min 27s ago

于是用who -b看系统上次启动的时间，发现主机在16:22自己重启了。。。

[root@node-108 ~]# who -b 系统引导 2023-03-15 16:22

可能是主机不稳定吧，只能把节点设成不可调度了。

[root@master-101 ~]# kubectl cordon node-108 node/node-108 cordoned

【本文地址】

公司简介

联系我们